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We present the current status of the apeNEXT project. Aim of this project is the development of the next 
generation of APE machines which will provide multi-teraflop computing power. Like previous machines, apeNEXT 
is based on a custom designed processor, which is specifically optimized for simulating QCD. We discuss the 
machine design, report on benchmarks, and give an overview on the status of the software development. 


1. INTRODUCTION 

The apeNEXT project was initiated pp with the 
goal to build supercomputers with a peak perfor¬ 
mance of more than 5 TFlops and a sustained 
efficiency of 0(50%) for key lattice gauge the¬ 
ory kernels. Aiming for both large scale simula¬ 
tions with dynamical fermions and quenched cal¬ 
culations on very large lattices the architecture 
should allow for large on-line data storage and 
input/output channels to sustain 0(0.5) MByte 
per second per GFlops. Finally, the programming 
environment should allow smooth migration from 
older APE systems, i.e. support the TAG lan¬ 
guage, and provide the G language with compa¬ 
rable performance. 
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Although there are a number of similarities be¬ 
tween the architecture of apeNEXT and former 
generations of APE supercomputers, there were a 
number of design challenges to be solved in or¬ 
der to meet the machine specifications outlined 
above. For apeNEXT all processor functionali¬ 
ties, including the network devices, were to be 
integrated into one single custom chip running at 
a clock frequency of 200 MHz. Unlike former ma¬ 
chines, the nodes will run asynchronously, which 
means that apeNEXT follows the single program 
multiple data (SPMD) programming model. 

2. PROCESSOR AND GLOBAL DESIGN 

The apeNEXT processor is a 64-bit architec¬ 
ture. Its arithmetic unit can at each clock cycle 
perform the APE normal operation a x 6-|-c, where 
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Figure 1. The apeNEXT processor. 

a, b, and c are IEEE double precision complex 
numbers. The peak performance of each node is 
therefore 1.6 GFlops. Like previous APE comput¬ 
ers apeNEXT provides a very large register file of 
256 (64-|-64)-bit registers. Selected details of the 
processor are shown in Fig.^ 

The memory interface of apeNEXT supports 
DDR-SDRAM from 256 MByte upto 1 GByte. 
The memory is used to store both data and 
program instructions. Gonflicts between data 
and instruction load-operations are therefore 
likely. These could easily become significant since 
apeNEXT is a microcoded architecture controlled 
by 128-bit very long instruction words. Two 
strategies have been employed to avoid these con¬ 
flicts. First, the hardware supports compression 
of the microcode. The compression rate usually 
depends on the level of optimization, typically it 
is in the range of 40-70%. Second, an instruction 
buffer allows pre-fetching of (compressed) instruc¬ 
tions. Gontrolled by software a section of the in¬ 
struction buffer can be used to store performance 
critical kernels for repeated execution. 

Each apeNEXT node contains seven LVDS link 
interfaces which allow for concurrent send and 
receive operations. Once a communication re¬ 
quest is queued it is executed independently of 
the rest of the processor, which is a pre-requisite 
for overlapping network and floating point oper¬ 
ations. Each link is able to transmit one byte per 
clock cycle, i.e. the gross bandwidth is 200 MByte 
per second per link. Due to protocol overhead 
the effective network bandwidth is < 180 MByte 


Table 1 


Key machine parameters 


clock frequency 

200 MHz 

peak performance 

1.6 GFlops 

memory 

256-1024 MByte/node 

memory bandwidth 

3.2 GByte/sec 

network bandwidth 

0.2 GByte/sec/link 

register file 

512 registers 

instruction buffer 

4096 words 


per second. The network latency is 0(0.1 fxs) 
and therefore at least one order of magnitude 
smaller than for today’s commercial high perfor¬ 
mance network technologies. 

Six of these link interfaces are used for connect¬ 
ing each node to its nearest neighbours within a 
three-dimensional network. The seventh link of 
up to one node per board can be used as an I/O 
channel by connecting it to an external front-end 
PC equipped with a custom PCI-LVDS interface 
card. The number of external links and therefore 
the total I/O bandwidth can be flexibly adapted 
to the needs of the users. Although all nodes are 
connected to their nearest neighbours only, the 
hardware allows routing across up to three or¬ 
thogonal links to all nodes on a cube, i.e. connect¬ 
ing nodes at distance (A^,, Ay, A^) with jA/ < 1. 

Although the network bandwidth is large com¬ 
pared to other network technologies, it is sig¬ 
nificantly smaller than the local memory band¬ 
width. It is therefore mandatory to support effi¬ 
cient mechanisms for data pre-fetching. For this 
purpose a set of pre-fetch queues is provided. Pre¬ 
fetch instructions in a user program will initi¬ 
ate the memory controller and, in case of remote 
data, the network to move the requested data into 
the queues. At a later stage of program execu¬ 
tion this data is loaded from the queues into the 
register file in the same order as the pre-fetch in¬ 
structions had been issued. Only if the data is 
not available at that point the processor will be 
halted until the data has arrived. 

The global design of apeNEXT is shown in 
Fig.ia There will be 16 apeNEXT processors on 
one processing board and 16 boards will be at¬ 
tached to one backplane. Each node is connected 
to a simple I2C-link used for bootstrapping and 
controlling the machine. 
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Figure 2. Possible apeNEXT configuration with 
4 boards, 2 external LVDS-links for I/O, and a 
chained I2C-link for slow-control. 


3. SOFTWARE AND BENCHMARKS 

We will provide both a TAO and a C compiler 
for apeNEXT. The latter is based on the freely 
available Icc compiler | 2 ] and supports most of the 
ANSI 89 standard with a few language extensions 
required for a parallel machine. For machine spe¬ 
cific optimizations at the assembly level, e.g. ad¬ 
dress arithmetics and register move operations, 
the software package sofan is under development. 
Finally, the microcode generator [shaker) opti¬ 
mizes instruction scheduling, which for APE ma¬ 
chines is completely done by software. 

For all parts of the compiler software stable 
prototype versions are available and were already 
used to benchmark the apeNEXT design. For this 
purpose we considered various typical linear al¬ 
gebra operations like the product of two complex 
vectors. This operation is basically limited by the 
memory bandwidth, implying a maximum sus¬ 
tained performance of 50%. From VHDL sim¬ 
ulations that include all machine details the ef¬ 
ficiency was found to be 41%. Even higher per¬ 
formance rates can be achieved for operations re¬ 
quiring more floating point operations per mem¬ 
ory access, like multiplying arrays of SU(3) matri¬ 
ces, which achieves an efficiency of 65%. In QCD 
simulations most of the time is spent applying 
the Dirac operator, e.g. the Wilson-Dirac opera¬ 
tor M — 1 — kH. We therefore investigated the 
operation Hip for which a sustained performance 
of 59% has been measured. This figure is made 


possible by extensive use of the pre-fetch features 
of the processor, and keeping a local copy of the 
gauge fields to save network bandwidth. Even 
for the smallest local lattices complete overlap of 
floating point operations and network communi¬ 
cation is possible, so the time when the processor 
waits for data is close to zero. 

4. apeNEXT PC PROJECT 

While pursuing the aim of building a cus¬ 
tom designed multi-teraflop computer the APE- 
collaboration started activities to develop a fast 
network, which is also based on LVDS, for in¬ 
terconnecting PCs. The final network interface 
is planned to consist of two bi-directional links 
with a bandwidth of 400 MByte per second each. 
Presently, a test setup with two PCs is running 
stable using prototype interfaces with one link 
each and a bandwidth of 180 MByte per second. 
For this setup running a QCD solver code the 
sustained network bandwidth was found to be 77 
MByte per second. A similar setup using the fi¬ 
nal network interfaces is expected to come into 
operation in September 2002. 

5. OUTLOOK AND CONCLUSIONS 

The hardware design of the next generation of 
APE custom built computers has been completed. 
While prototype boards and backplane are avail¬ 
able since the end of 2001, a prototype apeNEXT 
processor is expected to be ready by the end of 
2002. A larger prototype installation is planned 
to be running by middle of 2003. There exists a 
stable prototype version for all parts of the com¬ 
piler software. Based on this software we were 
able to demonstrate that key lattice gauge the¬ 
ory operations will be able to run at a sustained 
performance of 0(50%) or more. 
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